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ABSTRACT 

The  emergence  of  the  magic  number  2  in  recent  statistical  literature  is 
explained  by  adopting  the  predictive  point  of  view  of  statistics  with  entropy 
as  the  basic  criterion  of  the  goodness  of  a  fitted  model.  The  historical 
development  of  the  concept  of  entropy  is  reviewed  and  its  relation  to 
statistics  is  explained  by  examples.  The  importance  of  the  entropy 
maximization  principle  as  the  basis  of  the  unification  of  conventional  and 
Bayesian  statistics  is  discussed. 
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SIGNIFICANCE  AND  EXPLANATION 


This  work  is  concerned  with  the  clarification  of  the  importance  of  the 
roles  played  by  the  predictive  point  of  view  and  the  concept  of  entropy  in 
statistics.  It  starts  with  the  discussion  of  the  appearance  of  a  common 
constant,  the  so  called  magic  number  2,  in  various  applications  of  statistics, 
and  shows  that  it  is  deeply  related  to  the  predictive  use  of  statistics. 

The  historical  development  of  the  statistical  concept  of  entropy  by 
L.  Boltzmann  is  reviewed  and  the  common  confusion  caused  by  the  adoption  of 
the  Shannon  entropy  is  eliminated.  The  close  relation  between  statistics  and 
the  Boltzmann  entropy  is  illustrated  by  examples. 

The  objectivity  of  the  log  likelihood  as  the  criterion  of  fit  is 
explained  with  the  aid  of  the  entropy.  The  generality  of  the  magic  number  2 
is  then  demonstrated  in  its  relation  to  an  information  criterion  AIC  which  is 
realized  by  combining  the  predictive  point  of  view  and  the  concept  of  entropy. 

The  discussion  leads  to  the  entropy  maximization  principle  which 
specifies  the  object  of  statistics  as  the  maximization  of  the  expected  entropy 
of  a  fitted  predictive  distribution.  It  is  shown  that  this  principle  provides 
a  basis  for  the  unification  of  conventional  and  Bayesian  statistics. 

Obviously  the  recognition  of  such  possibility  contributes  significantly  to  the 
enhancement  of  research  activity  in  the  general  area  of  statistics. 


The  responsibility  for  the  wording  and  views  expressed  in  this  descriptive 
summary  lies  with  MRC  and  not  with  the  author  of  this  report. 


PREDICTION  AND  ENTROPY 


Hirotugu  Akaike* 


1.  Introduction  and  summary 

In  this  paper  we  start  with  an  observation  that  the  emergence  of  a  particular 
constant,  the  magic  number  2,  in  several  statistical  papers  is  inherently  related  with  the 
predictive  use  of  statistics.  The  generality  of  the  constant  can  only  be  appreciated  when 
we  adopt  the  statistical  concept  of  entropy,  originally  developed  by  a  physicist 
L.  Boltzmann,  as  the  criterion  to  measure  the  deviation  of  a  distribution  from  another. 

A  historical  review  of  Boltzmann's  work  on  entropy  is  given  to  provide  a  basis  for  the 
interpretation  of  the  statistical  entropy.  The  neg-entropy,  or  the  negative  of  the 
entropy,  is  often  equated  to  the  amount  of  information.  This  review  clarifies  the 
limitation  of  Shannon's  definition  of  the  entropy  of  a  probability  distribution.  The 
relation  between  the  Boltzmann  entropy  and  the  asymptotic  theory  of  statistics  is  discussed 
briefly. 

The  concept  of  entropy  provides  a  proof  of  the  objectivity  of  the  log  likelihood  as  a 
measure  of  the  goodness  of  a  statistical  model.  It  is  shorn  that  this  observation, 
combined  with  the  predictive  point  of  view,  provides  a  simple  explanation  of  the  generality 
of  the  magic  number  2.  This  is  done  through  the  explanation  of  the  AXC  statistic 
introduced  by  the  present  author.  The  use  of  AIC  is  illustrated  by  its  application  to 
multidimensional  contingency  table  analysis. 

The  discussion  of  AIC  naturally  leads  to  the  entropy  maximization  principle  which 
specifies  the  object  of  statistics  as  the  maximization  of  the  expected  entropy  of  a  true 
distribution  with  respect  to  the  fitted  predictive  distribution.  The  generality  of  this 
principle  is  demonstrated  through  its  application  to  Bayesian  statistics.  The  necessity  of 
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Bayesian  modeling  is  discussed  and  its  similarity  to  the  construction  of  the  statistical 
model  of  thermodynamics  by  Boltzmann  is  pointed  out.  The  principle  provides  a  basis  for 
the  unification  of  the  Bayesian  and  conventional  statistics.  Referring  to  Boltzmann's 
fundamental  contribution  to  statistics,  the  paper  concludes  by  emphasizing  the  importance 
of  the  research  on  real  problems  for  the  development  of  statistics. 


2.  Emergence  of  the  magic  number  2 

Around  the  year  of  1970  a  curious  emergence  of  a  constant  has  been  observed  in  a 
series  of  papers.  This  is  the  emergence  of  what  Stone  (1977a)  symbolically  calls  the  magic 
number  2. 

The  number  appears  in  Mallow's  cp  statistic  for  the  selection  of  independent 
variables  in  multiple  regression  which  is  by  definition 

cp  “  "7  RSSp  -  n  +  2p  , 

where  RSSp  denotes  the  residual  sum  of  squares  after  regression  on  p  independent 

2  2 

variables,  n  the  sample  size  and  s  an  estimate  of  the  common  variance  o  of  the 
error  terms  (Mallows,  1973).  The  final  prediction  error  (FPS)  introduced  by  Akaike  (1969, 
1970)  for  the  determination  of  the  order  of  an  autoregression  is  an  estimate  of  the  mean 
squared  error  of  the  one-step  prediction  when  the  fitted  model  is  used  for  prediction.  It 
satisfies  asymptotically  the  relation 

n  log  FPE  »  n  log  Sp  +  2p  , 

where  n  denotes  the  length  of  the  time  series,  Sp  the  maximum  likelihood  estimate  of 
the  innovation  variance  obtained  by  fitting  the  p*“  order  autoregression.  Both  Leonard  and 
Ord  (1976)  and  Stone  (1977a)  noticed  the  emergence  of  the  number  as  the  asymptotic  critical 
level  of  F-tests  when  the  number  of  observations  is  increased. 

An  explanation  of  this  number  2  can  easily  be  given  for  the  case  of  the  multiple 
regression  analysis.  The  effect  of  regression  is  usually  evaluated  by  the  value  of  RSSp. 

A  smaller  RSSp  may  be  obtained  by  increasing  the  number  of  independent  variables  p. 
However,  we  know  that  after  adding  a  certain  number  of  independent  variables  further 
addition  of  variables  often  merely  increases  the  expected  variability  of  the  estimate. 

When  the  increase  of  the  expected  variability  is  measured  in  terms  of  the  mean  squared 
prediction  error,  it  will  be  seen  that  the  increase  is  exactly  equal  to  the  expected  amount 
of  decrease  of  the  sample  residual  variance  RSSp/n.  Thus  to  convert  RSSp  into  an 
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unbiased  estimate  of  the  mean  squared  error  of  prediction  we  must  apply  twice  the 

2 

correction  that  is  required  to  convert  RSSp  into  an  unbiased  estimate  of  no  . 

The  appearance  of  the  critical  value  2  for  the  F-test  discussed  by  Leonard  and  Ord 

(1976)  is  more  instructive.  The  F-test  is  considered  as  a  preliminary  test  of  significance 

in  the  estimation  of  the  one-way  ANOVA  model  where  K  independent  observations  y  j  ^ 

(k  «  1,2,. ...K)  are  taken  from  each  group  j  (j  “  1,2, Under  the  assumption 

2 

that  yjk  are  distributed  as  normal  with  mean  8^  and  variance  0^  the  F-statistic  for 
testing  the  hypothesis  9  »  8  »  •••  »  8  is  given  by 

I  2  J 


<J  - 

(J(K  -  1))_1sJ 


where  -  y..)  end  SJ  -  2 ^ ( y iJc  -  Yj.)  and  where  y^.  and  y,.  denote 

the  mean  of  the  jth  group  and  the  grand  mean,  respectively.  The  final  estimate  of  8^  is 
defined  by 

8^  “  Yj.  i£  the  hypothesis  is  rejected 
y>#  otherwise  . 

Consider  the  loss  function  L(8,8)  «  1(8^  -  8^  )^ .  For  the  simpler  estimates  defined 
by  ”  Y..  and  ®j  ”  Yj,  It  can  easily  be  shown  that  the  difference  of  the  risks  of 
these  estimates  has  one  and  the  same  sign  as  that  of  E( J(K  -  1))  SW(F  -  2).  Thus  when 
the  sample  size  K  is  sufficiently  large  the  choice  of  the  critical  value  2  for  the  F-test 
to  select  8^  is  appropriate. 

The  characteristic  that  is  common  to  these  papers  is  that  the  authors  considered  some 
predictive  use  of  the  models.  An  early  example  of  the  use  of  the  concept  of  future 
observation  to  clarify  the  structure  of  an  inference  procedure  is  Fisher  (1935,  p.  393). 

The  concept  is  explicitly  adopted  as  the  philosophical  motivation  in  a  work  by  Guttman 
(1967).  In  the  present  paper,  the  point  of  view  that  considers  the  purpose  of  statistics 
as  the  realization  of  appropriate  predictions  will  generally  be  called  the  predictive  point 


of  view 


In  the  above  example  of  the  ANOVA  model,  if  the  number  of  groups  J  Is  increased 
indefinitely,  the  test  statistic  F  converges  to  1  under  the  null  hypothesis.  Thus  the 
critical  value  of  the  F-test  for  any  fixed  level  of  significance  must  also  converge  to  1 
instead  of  2.  As  is  observed  by  Leonard  and  Ord  this  dramatically  demonstrates  the 
difference  between  the  conventional  approach  to  model  selection  by  testing  with  a  fixed 
level  of  significance  and  the  predictive  approach.  Since  there  is  no  generally  accepted 
criterion  for  the  selection  of  the  level  of  significance  the  present  result  must  be 
considered  as  a  warning  against  the  conventional  testing  procedure.  Thus  the  emergence  of 
the  magic  number  2  must  be  considered  as  a  sign  of  the  impending  change  of  the  paradigm  of 
statistics.  However,  to  fully  appreciate  the  generality  of  the  number,  we  have  first  to 
expand  our  view  of  the  statistical  estimation  procedure. 
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3.  From  point  to  distribution 

The  risk  functions  considered  in  the  preceding  section  were  the  mean  squared  errors  of 
the  predictions.  Such  a  choice  of  the  criterion  is  conventional  but  quite  arbitrary.  The 
weakness  of  the  ad  hoc  definition  becomes  apparent  when  we  try  to  extend  the  concept  to 
multivariate  problems. 

A  typical  example  of  multivariate  analysis  is  factor  analysis.  At  first  sight  it  is 
not  at  all  clear  how  the  analysis  is  related  to  prediction.  In  1971,  trying  to  extend  the 
concept  of  FPB  to  solve  the  problem  of  determination  of  the  number  of  factors,  the 
present  author  came  to  the  recognition  that  in  factor  analysis  our  prediction  was  realized 
through  the  specification  of  a  distribution  (Akaike,  1981).  This  observation  quickly  led 
to  the  observation  that  almost  all  the  important  statistical  procedures  are  concerned  with 
the  realization  of  predictions  through  the  specification  of  some  distributions. 

Stigler  (1975)  noticed  the  shift  of  the  interest  of  statisticians  from  point  to 
distribution  estimation  towards  the  end  of  the  19th  century.  However,  it  seems  that 
Fisher's  very  effective  use  of  the  concept  of  parameter  drew  the  attention  of  statisticians 
back  to  the  estimation  of  a  point  in  a  parameter  space.  We  are  now  in  a  position  to  return 
to  distributions  and  here  the  basic  problem  is  the  introduction  of  a  natural  topology  in 
the  space  of  distributions.  The  probabilistic  interpretation  of  thermodynamic  entropy 
developed  by  Boltzmann  provides  a  historically  most  successful  example  of  a  solution  to 
this  problem. 
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4.  Entropy  and  information 

The  statistical  interpretation  of  the  thermodynamic  entropy/  a  measure  of  the 
unavailable  energy  within  a  thermodynamic  system,  was  developed  in  a  series  of  papers  by 
L.  Boltzmann  in  1670's.  His  first  contribution  was  the  observation  of  the  monotone 
decreasing  behavior  in  time  of  a  quantity  defined  by 

m 

E  «  /  f  <x,t)log[--*f  ^-jdx  , 

0  /k 

where  f(x,t)  denotes  the  frequency  distribution  of  the  number  of  molecules  with  energy 
between  x  and  x  +  dx  at  time  t  (Boltzmann,  1872).  Boltzmann  showed  that  for  a  closed 
system,  under  proper  assumptions  of  the  collision  process  of  the  molecules,  the  quantity  E 
can  only  decrease.  When  the  distribution  f  is  defined  with  the  velocities  and  positions 
of  the  molecules  the  above  quantity  takes  the  form 

E  -  //  f  log  f  dxd£  , 

where  x  and  £  denote  the  vectors  of  the  position  and  velocity,  respectively.  Boltzmann 
showed  that  for  some  gases  this  quantity,  multiplied  by  a  negative  constant,  was  identical 
to  the  thermodynamic  entropy. 

The  negative  of  the  above  quantity  was  adopted  by  C.  E.  Shannon  as  the  definition  of 
the  entropy  of  a  probability  distribution 

H  "  -  /  p(x)log  p(x)dx  , 

where  p(x)  denotes  the  probability  density  with  respect  to  the  measure  dx  (Shannon  and 
Weaver,  1949). 

Almost  uncountably  many  papers  and  books  have  been  written  about  the  use  of  the 
Shannon  entropy,  where  the  quantity  H  is  simply  referred  to  as  a  measure  of  information, 
or  uncertainty,  or  randomness.  One  departure  from  this  definition  of  entropy  is  known  as 
the  Kullback-Leibler  information  (Kullback  an  Leibler,  1951)  which  is  defined  by 

I(q»p)  “  /  q(x)log(^|)dx 

and  relates  the  distribution  q(x)  to  another  distribution  p(x).  Kullback  (1959,  p.  6) 
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called  this  quantity  the  mean  information  per  observation  from  q(x)  for  discrimination  in 
favor  of  q(x)  against  p(x). 

iuch  interest  has  been  shown  in  the  use  of  these  quantities  as  measures  of  statistical 
information.  However,  it  seems  that  the  potential  of  these  quantities  as  statistical 
concepts  has  not  been  fully  evaluated.  Apparently  this  is  due  to  the  neglect  of 
Boltzmann's  original  work  on  the  probabilistic  interpretation  of  thermodynamic  entropy. 

Karl  Pearson  (1929,  p.  205)  cites  the  words  of  D.  F.  Gregory  "...  we  sacrifice  many  of  the 
advantages  and  more  of  the  pleasure  of  studying  any  science  by  omitting  all  reference  to 
the  history  of  its  progress."  It  seems  that  this  has  been  precisely  the  case  with  the 
development  of  the  statistical  concept  of  entropy  or  information. 


5.  Distribution  and  entropy 

The  work  of  Boltzmann  (1B72)  produced  a  demonstration  of  the  second  law  of 
thermodynamics,  the  irreversible  increase  of  entropy  in  an  isolated  closed  system.  In 
answering  the  criticism  that  the  proof  of  irreversibility  is  based  on  the  assumption  of  a 
reversible  mechanical  process  Boltzmann  (1877a)  pointed  out  the  necessity  of  probabilistic 
interpretation  of  the  result. 

At  that  time  Meyer,  a  physicist,  produced  a  derivation  of  the  Maxwell  distribution  of 
the  kinetic  energy  among  gas  molecules  at  equilibrium  as  the  "most  probable"  distribution. 
Pointing  out  the  error  of  Meyer's  proof  Boltzmann  (1877b)  established  the  now  well-known 
identity 

entropy  *  log  probability  of  a  statistical  distribution. 

His  reasoning  was  based  on  the  asymptotic  equality 


log 


_ nl _ 

n  In  I  •••  n  1 

0  1  p 


(1) 


where  n^  denotes  the  frequency  of  the  molecules  at  the  ith  energy  level  and 

n«n(j  +  n^  +  ,*,+  Up.  If  we  put  pA  -  nj/n  then  the  right  hand  side  is  equal  to  nH(p), 

where 


H(p)  -  -  i  Pi.  lo9  Pi 
i-0 

which  is  the  Shannon  entropy  of  the  distribution  p  »  (Pg>Pi» • • • »Pp>- 

Following  the  idea  that  the  frequency  distribution  f  of  molecules  at  a  thermal 
equilibrium  is  the  distribution  which  is  the  most  probable  under  the  assumption  of  a  given 
total  energy,  Boltzmann  maximized 


H(f)»  -  /  f  log  fdx 
0 
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under  the  constraints 


/  fdx  “  N  and  /  xf(x)dx  -  L  , 

0  0 

where  x  denotes  the  energy  level,  N  the  total  number  of  molecules  and  L  the  total 
energy.  The  maximization  produces  as  the  energy  distribution  f(x)  *  C  exp(-hx)  with  a 
proper  positive  constant  h.  Boltzmann  discussed  in  great  detail  that  this  result  could  be 
physically  meaningful  only  for  a  proper  definition  of  the  energy  level  x,  a  point 
commonly  ignored  by  later  users  of  the  Shannon  entropy.  Incidentally  we  notice  here  an 
early  derivation  of  the  exponential  family  of  distributions  by  the  constrained  maximization 
of  H(f),  a  technique  of  probability  distribution  generation  later  called  by  the  name  of 
maximum  entropy  method  (Jaynes,  1957). 

The  change  of  the  Boltzmann's  view  of  the  energy  distribution  between  1B72  and  1877  is 
quite  significant.  In  the  1872  paper  the  distribution  f(x,t)  represented  a  unique 
entity.  In  the  1877b  paper  the  distribution  was  considered  as  a  random  sample  and  its 
probability  of  occurrence  was  the  main  subject. 

Boltzmann  (1878)  further  extended  the  discussion  of  this  point.  Since  the  probability 
of  getting  a  sample  frequency  distribution  (wQ,Wj , . . . ,wp)  from  a  probability  distribution 
(fg.f1«....fp)  is  given  by 


a 


*0*1 


fwP - ni - , 

P  w.lw.l  •••  w  1 

0  1  p 


Boltzmann  obtained  an  asymptotic  equality 

in  m  w-if.  +  w.if.  +  •••  +  w  if  -  w.iw.  -  w.iw.  •••  -  w  tw  ♦  const.  (2) 

00  11  ppOO  11  PP 

where  n  »  w1  +  w2  +  •••  +  wp  and  t  denotes  the  natural  logarithm.  He  pointed  out  that 

the  former  formula  (1)  is  a  special  case  of  (2)  where  it  is  assumed  that 

tg  “  fj  *  •••  *  fp.  Ignoring  the  additive  constant  the  present  formula  (2)  can  be 


rearranged  in  the  form 


tfl  -  -n  f  g.*^)  ' 
i-0  i 

where  g^  =  w^/n.  Thus  to  retain  the  interpretation  that  the  entropy  is  the  log 
probability  of  a  distribution  we  have  to  adopt,  instead  of  H(p),  the  quantity 

B(gif)  »  -  l  g.logb^) 
i  1  1 

as  the  definition  of  the  entropy  of  the  secondary  distribution  g  with  respect  to  the 
primary  distribution  f.  When  the  distributions  g  and  f  are  defined  with  densities 
f(x)  and  g(x)  the  entropy  is  defined  by 

B(g»f )  -  -  /  g<x)log(^y)dx  . 

When  it  is  necessary  to  distinguish  this  quantity  from  the  thermodynamic  entropy  or  the 
Shannon  entropy  we  will  call  it  the  Boltzmann  entropy.  It  is  now  obvious  that  B(  gi  f ) 
provides  a  natural  measure  of  deviation  of  g  from  f. 

The  equality  of  the  above  quantity  to  the  thermodynamic  entropy  holds  only  when  the 
former  is  maximized  under  the  assumption  of  a  given  mean  energy  for  an  appropriately  chosen 
"primary  distribution"  f  and  then  multiplied  by  a  proper  constant.  Thus  it  can  be  seen 
that  the  shannon  entropy  H(g)  ■  -  I  g  log  g  obtains  the  physical  meaning  of  the  entropy 
contemplated  by  Boltzmann  only  under  very  limited  circumstances.  Obviously  IQ  or  B(g;f) 
is  the  more  fundamental  concept.  This  point  is  reflected  in  the  fact  that  in  Shannon  and 
Weaver  (1949)  essential  use  is  made  not  of  H(f)  but  of  its  derived  quantities  taking  the 
form  of  B(g;f). 

The  Kullback-Leibler  (KL)  information  number  is  defined  by  I(g;f)  ■  -B(gjf). 

Contrary  to  the  formal  definition  of  I(g;f)  by  Kullback  (1959)  the  present  derivation  of 
B(gjf)  based  on  Boltzmann's  ifl  clearly  explains  the  difference  of  the  roles  played  by 
g  and  f.  The  primary  distribution  f  is  hypothetical,  while  the  secondary  g  is 
factual.  Boltzmann  (1878)  also  arrived  at  a  generalization  of  the  exponential  family  of 
distributions  by  maximizing  the  entropy  under  certain  constraints.  These  results 
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demonstrate  the  fundamental  contribution  of  Boltzmann  to  the  science  of  statistics, 
summary  of  mathematical  properties  of  the  Boltzmann  entropy  or  the  Kullback-Leibler 
information  is  given  by  Csiszar  (1975). 
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6.  Entropy  and  the  asymptotic  theory  of  statistica 

The  Boltzmann  entropy  appeared,  sometimes  implicitly,  in  many  basic  contributions  to 
statistics,  particularly  in  the  area  of  asymptotic  theory.  For  a  pair  of  distributions 
p(»|8,)  and  p(  •  |  ©2 )  from  a  parametric  family  {p{*|9)>  Be©}  the  deviation  of  the 
former  from  the  latter  can  be  measured  by  B(B^;6^)  -  B(p(*|8,)i  p(  •  I  ®2 ) ) .  This  induces  a 
natural  topology  in  the  space  of  parameters. 

When  8,  and  8^  are  k-dimensional  parameters  given  by  8(  ■  (8 n '®12' *  *  * '® Ik*  and 
8^  -  ( 0 2 1  /®22 ' *  *  * '®2k  ^ '  undar  appropriate  regularity  conditions  we  have 

2 

B(8i;82)  -  -  i  (82  -  8t)*E  ^  log  P<x|8,><82  -  8,)  ♦  o(I82  -  6,1*)  , 


where  (32/38'38)log  p(x|6  )  denotes  the  Hessian  evaluated  at  6-8,  and  E  the 
expectation  with  respect  to  p(*|6,)  and  o{I62  -8,1)  a  term  of  order  lower  than 
*®2  ”  91*2  "  i  <®^i  ~  ®2i*2*  Obvioual*  -E(32/38*38)log  p(x|8,)  is  the  Fisher 
information  matrix.  The  fact  that  the  Fisher  information  matrix  is  just  minus  twice  the 
Hessian  of  the  entropy  clearly  shows  that  it  is  related  to  the  local  property  of  the 
topology  induced  by  the  entropy. 

The  likelihood  ratio  teat  statistic  for  testing  a  specific  model,  or  hypothesis, 

\ 

defined  by  8  «  8Q  is  given  by 

BP<»J®0> 

n  ”  sup{Hp(xi|8);  6  e  0}  ' 


where  (Xj.Xj, • « . ,xn)  denotes  the  sample.  If  the  true  distribution  is  defined  by  p(*|8) 
we  expect  that 


T  «  -  —  log  X 
n  n  n 

will  stochastically  converge  to  -B(6)6Q)  as  n  is  increased  to  infinity.  The  result  of 
Bahadur  (1967)  shows  that  under  certain  regularity  conditions  it  holds  that 


lim-  log  P(T  >  t|8  )  -  8(  8(8)  , 
n  n  n  o  0 

n+" 
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where  tn  denotes  the  sample  value  of  the  test  statistic  Tn  for  a  particular 
realization  (x^ ,Xj, • • . ,xn) .  This  means  that  if  one  calculates  the  probability  of  the 
statistic  Tn  being  larger  than  tn,  assuming  that  the  data  has  come  from  the 
hypothetical  distribution  p< • 1 8Q > ,  it  will  asymptotically  be  equal  to  exp(nB( 0>0g ) ) , 
where  0  denotes  the  true  distribution • 

In  a  practical  application  the  hypothesis  will  never  be  exact  and  the  above  result 
says  that  by  calculating  the  P-value  of  the  log  likelihood  ratio  test  we  are  actually 
measuring  the  entropy  nB(0;0g)<  This  observation  eliminates  the  common  misconception  that 
considers  the  test  meaningless  due  to  the  certainty  of  0Q ' s  being  false. 

The  concept  of  second  order  efficiency  was  introduced  by  Rao  (1961).  In  that  paper  he 
discussed  the  performance  of  an  estimator  obtained  by  minimizing  the  Kullback-Leibler 
information  number  £  x^logfx^/p^ ) ,  where  denotes  the  probability  of  the  rth  cell  in 

a  multinomial  distribution,  defined  as  a  function  of  a  parameter  0,  and  pr  the  observed 
relative  frequency.  This  estimator  can  also  be  characterized  as  the  one  that  maximizes 
B(«>p),  while  the  maximum  likelihood  estimate  maximizes  B(p»«). 

If  we  carefully  follow  the  derivation  of  B(g»f )  we  can  see  that  the  primary 
distribution  f  is  always  hypothetical,  while  the  secondary  distribution  g  is  factual. 

It  is  interesting  to  note  that  Rao  has  shown  that  the  minimum  XL  number  estimator,  defined 
by  the  entropy  with  a  factual  primary  distribution  and  an  hypothetical  secondary,  is  less 
efficient  than  the  maximum  likelihood  estimator  defined  by  the  more  natural  definition  of 
the  entropy.  A  similar  relation  has  been  observed  between  the  estimators  defined  by 
minimizing  the  chi-square  and  the  modified  chi-square  that  are  approximations  to  -2B(p>») 
and  -2B(»»p),  respectively.  These  results  suggest  that  the  present  interpretation  of 
entropy  can  produce  useful  insights  not  available  from  the  Fisher  information  which  does 
not  discriminate  between  the  primary  and  secondary  distributions. 

The  relation  between  the  entropy  and  the  asymptotic  distribution  of  the  corresponding 
sample  distribution  function  is  discussed  by  Sanov  (1957)  and  Stone  (1974).  Some  standard 
references  on  the  relation  between  the  entropy  and  large  sample  theory  are  Chernoff  (1956) 
and  Rao  (1962). 
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7.  Likelihood,  entropy  and  the  predictive  point  of  view 

Obviously,  one  of  the  most  significant  contributions  to  statistics  by  R.  A.  Fisher  is 
the  development  of  the  method  of  maximum  likelihood.  However,  there  is  a  definite 
limitation  to  the  applicability  of  the  idea  of  maximizing  the  likelihood. 

The  limitation  can  oust  clearly  be  seen  by  the  following  model  selection  problem. 
Consider  a  set  of  parametric  families  (pC*!^)}  (k  -  1,2,...,K)  defined  by 

-  <8jti'0]t2"”,ekk'SOk+l','‘'9OK*‘  In  th*  kth  onlF  the  first  k  components  of 

the  parameter  vector  6^  are  allowed  to  vary  but  the  rest  are  fixed  at  some  preasaigned 

values  9Qk+1 '  *  *  ‘  '%K*  Hhen  data  x  is  given,  if  we  simply  maximize  the  likelihood  among 

•  * 

the  whole  families,  we  always  end  up  with  the  choice  of  p( • 1 0R) ,  where  9^  denotes  the 
maximum  likelihood  estimate  that  maximizes  p(x| 9^) .  This  means  that  the  method  of  maximum 
likelihood  always  leads  to  the  selection  of  the  most  unrestricted  parametric  model.  This 
is  obviously  against  our  expectation.  The  counseling  by  a  statistician  of  the  choice  of 
the  highest  possible  order  whenever  fitting  a  polynomial  regression,  by  the  method  of 
maximum  likelihood,  will  certainly  lose  the  trust  of  his  clients. 

Fisher  was  clearly  aware  of  the  limitation  of  his  theory  of  estimation.  Pointing  out 
the  future  possibility  of  inductive  argument  which  will  discuss  methods  of  assigning  the 
functional  form  of  the  population  by  data  Fisher  (1936)  states  "At  present  it  is  only 
important  to  make  clear  that  no  such  theory  has  been  established”.  This  clearly  suggests 
the  necessity  of  extending  the  theory  of  statistical  estimation  to  the  situation  where 
several  possible  parametric  models  are  involved.  Such  an  extension  is  possible  with  a 
proper  combination  of  the  predictive  point  of  view  and  the  concept  of  entropy. 

The  predictive  point  of  view  demanded  the  generalization  of  the  concept  of  estimation 
from  that  of  a  parameter  to  that  of  the  distribution  of  a  future  observation.  We  will  call 
such  an  estimate  a  predictive  distribution.  The  basic  criterion  in  this  generalized  theory 
of  estimation  will  then  be  the  measure  of  the  goodness  of  the  predictive  distribution.  The 
expected  deviation  of  the  true  distribution  from  the  predictive  distribution  as  measured  by 
the  expected  entropy  EB(truei  predictive)  will  serve  for  this  purpose.  Here,  the 


of  the  form  of  the  true  distribution  p(y).  This  provides  a  proof  of  the  fact  that  the 
log  likelihood  is  an  objective  measure  of  the  goodness  of  fit  of  the  distribution  p(‘|6). 
Thus  we  see  that  a  definite  objectivity  is  being  imparted  to  statistical  inference  through 
the  use  of  log  likelihoods.  In  particular,  we  can  see  that  the  range  of  the  validity  of 
the  concept  of  likelihood  is  not  restricted  to  one  particular  parametric  family  of 
distributions.  This  observation  constitutes  the  basis  for  the  solution  of  the  model 
selection  problem  considered  at  the  beginning  of  this  section. 


I 


8.  Model  selection  and  an  information  criterion  (AIC) 

He  will  first  show  that  our  basic  criterion,  the  expected  entropy,  provides  a  natural 
extension  of  the  mean  squared  error  criterion.  The  quality  of  a  predictive  distribution 
f (y|x)  is  evaluated  by  the  expected  negentropy  defined  uy 

-E  B( f ( f ( • | x) )  -  E  log  f(y)  -  EE  log  f(y|x)  , 

X  I  A  J 

where  f(y)  denotes  the  true  distribution  of  y  which  is  assumed  to  be  independent  of 

x  and  E„  and  E  denote  the  expectations  with  respect  to  the  true  distributions  of  x 
x  y 

and  y,  respectively.  By  Jensen's  inequality  we  have  Exlog  f(y|x)  >  log  Exf(y|x)  and  we 

get  the  additive  decomposition 

-E  B( f > f ( • |x) )  ■  {E  log  f<y)  -  E  log  E  f(y|x)} 
x  y  y  x 

+  {E  log  E  f(y|x)  -  E  E  log  f(y|x)}  . 
y  x  y  X 

The  term  inside  the  first  brackets  on  the  right  hand  side  represents  the  amount  of  increase 
of  the  expected  negentropy  due  to  the  deviation  of  f(y)  from  Exf(y|x).  This  term 
corresponds  to  the  squared  bias  in  the  case  of  ordinary  estimation  of  a  parameter.  The 
term  inside  the  second  brackets  represents  the  increase  of  the  expected  negentropy  due  to 
the  sampling  fuctuation  of  f(y|x)  around  Exf<y|x).  This  quantity  corresponds  to  the 
variance.  The  present  result  shows  why  the  two  different  concepts,  squared  bias  and 
variance,  can  meaningfully  be  added. 

Having  observed  that  the  expected  negentropy  provides  a  natural  extension  of  the  mean 
squared  error  criterion  we  recognize  that  the  main  problem  is  the  estimation  of  the  entropy 
or  the  expected  log  likelihood  Eylog  f(y|x)  of  the  predictive  distribution.  In  the  case 
of  the  ANOVA  model  discussed  by  Leonard  and  Ord  the  F-test  was  used  for  the  selection  of 
the  model  underlying  the  definition  of  the  final  estimate.  For  the  present  general  model 
we  consider  the  use  of  the  log  likelihood  ratio  test.  The  test  statistic  for  the  testing 
of  {p(*lek>}  against  {p<* is  defined  by 

(-2 ) {log  p(x|©k)  -  log  p(x|9R)} 

and  is  tested  as  a  chi-square  with  the  degrees  of  freedom  K  -  k. 
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Me  consider  that  the  test  is  developed  to  sake  a  reasonable  choice  between  p(y | 9^) 

* 

and  p(y|8^).  Prom  our  present  point  of  view  this  means  that  the  test  must  be  in  good 

*  * 

correspondence  to  the  choice  by  ( -2 )E^ (log  p(y | 6^ )  -  log  p(y|9g)}.  The  result  of  Held 
(1943)  on  the  asymptotic  behavior  of  the  log  likelihood  ratio  test  shows  that,  when  x  is 
a  vector  of  observations  of  independently  identically  distributed  random  variables  with  the 
likelihood  functions  satisfying  certain  regularity  conditions,  we  have  asymptotically 
Ex[-2{log  p(x|8^)  -  log  pfxIS^)}]  -  19°  -  8°l*  +  (K  -  k)  , 
where  Ex  denotes  the  mean  of  the  limiting  distribution,  I  Ij  the  Euclidean  norm  defined 
by  the  Fisher  information  matrix,  and  8°  denotes  the  value  of  6^  that  maximizes 
Exlog  p(x|9^),  where  Ex  denotes  the  expectation  with  respect  to  the  true  distribution 
p(x|8°). 

Similarly  from  the  analysis  of  the  asymptotic  behavior  of  the  maximum  likelihood 

estimates  we  have  asymptotically 

Ex[-2Ey{log  p(y|8*)  -  log  p<yl8K>}]  ■  18°  -  8°l*  -  (K  -  k)  , 

where  the  restricted  predictive  point  of  view  is  adopted  and  x  and  y  are  assumed  to  be 

independently  identically  distributed. 

From  these  two  results  it  can  be  seen  that  as  a  measurement  of 

( -2 ) E^ { log  P<V  1  ®Jc )  “  lo9  P(yleK>>  the  log  likelihood  ratio  test  statistic 

(-2){log  p(x|8k>  -  log  p(x|8K>}  shows  an  upward  bias  by  the  amount  of  2(K  -  k).  If  we 

correct  for  this  bias  then  we  get  {-2  log  p(xi8fc)  +  2k}  -  {-2  log  p(x|8R)  +  2K)  as  a 

* 

measurement  of  the  difference  of  the  entropies  of  the  models  specified  by  p( • ( 8^)  and 

•  • 

p ( * | 9k> •  This  observation  leads  to  the  conclusion  that  the  statistic  -2  log  p(x|8^)  +  2k 

* 

should  be  used  as  a  measure  of  the  badness  of  the  model  specified  by  j>( •  1 6^ >  (Akaike, 
1973).  The  pseudonym  AIC  adopted  by  Akaike  (1974)  for  this  statistic  is  the  abbreviation 
of  "an  information  criterion”  and  is  symbolically  defined  by 

AIC  “  -2  log(maximum  likelihood) 

+2  (number  of  parameters)  , 
where  log  denotes  natural  logarithm. 
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If  the  log  likelihood  ratio  test  is  considered  as  a  measurement  of  the  entropy 
difference  then  the  above  observation  suggests  that  from  our  present  point  of  view  we 
should  choose  the  model  with  smaller  value  of  AIC.  If  we  follow  this  idea  we  get  an 
estimation  procedure  which  simultaneously  realizes  the  model  selection  and  parameter 
estimation.  An  estimate  thus  obtained  is  called  a  minimum  AIC  estimate  (MAICE) •  Now  it  is 
a  simple  matter  to  see  that  the  critical  level  2  of  the  F  test  by  Leonard  and  Ord 
corresponds  to  the  factor  2  of  the  second  term  in  the  definition  of  AIC. 

One  important  observation  about  AIC  is  that  it  is  defined  without  specific  reference 
to  the  true  model  p(*|8^)>  Thus,  for  any  finite  number  of  parametric  models,  we  may 
always  consider  an  extended  model  that  will  play  the  role  of  p( • jQ^).  This  suggests  that 
AIC  can  be  useful,  at  least  in  principle,  for  the  comparison  of  models  which  are  non¬ 
nested,  i.e.,  the  situation  where  conventional  log  likelihood  ratio  test  is  not  applicable. 

We  will  demonstrate  the  practical  utility  of  AIC  by  its  application  to  the 
multidimensional  contingency  table  analysis  discussed  by  Goodman  (1971).  Observing  the 
frequency  f^^  in  the  cell  (i, j,k,i)  of  a  4-way  contingency  table  (i  -  1,2,.. .,1; 

J  «  1,2,..., J;  k  »  1,2, ... ,K;  t  -  1,2, ... ,L)  with  2^ jkf  “  n  the  model  is 

specified  by  the  parametrization 


log  F. 


0  ♦  A*  ♦  •••  +  A°  +  A“  +  •••  +  +  •••  +  +  aABCD 


ijkt  i  t  ij  kt  ijk  jkl  ijkl  ' 

where  denotes  the  expected  frequency  and  the  A’s  satisfy  the  condition  that  any 

sum  with  respect  to  one  of  the  suffices  is  equal  to  zero.  The  characters  A,  B,  C,  D 
symbolically  denotes  the  group  of  parameters  that  are  related  with  the  factors  denoted  by 
these  characters.  Hypotheses  are  defined  by  putting  some  of  the  parameters  equal  to  zero. 

Goodman  discussed  the  application  to  the  analysis  of  detergent  user  data  which 
included  information  on  the  following  four  factors:  the  softness  of  the  water  used  (s! , 
the  previous  use  of  a  brand  (U),  the  temperature  of  the  water  used  (T)  and  the  preference 
of  a  brand  over  the  other  (P).  In  the  following  Table  1  the  initial  portion  of  Goodman’s 
Table  3  is  shown  with  the  corresponding  AIC’s.  In  the  Goodman's  modeling  when  a  higher 
order  effect  is  considered  all  the  corresponding  lower  order  effects  are  included  in  the 


model 


Table  1i  Goodman's  analysis  of  consumer  data 


Hypothesis 

Estimated  Group 
of  Parameters 

Degrees  of 
Freedom 

(-2) Log 

Likelihood  Ratio 

* 

AIC 

1 

None 

23 

118.63 

72.63 

2 

8,  P,  T,  U 

18 

42.93 

6.93 

3 

All  the  pairs 

9 

9. 85 

-  8.15 

4 

All  the  triplets 

2 

0.74 

-  3.26 

S 

s 

CD 

4 

17 

22.35 

-11.65 

6 

PU,  s 

18 

95.56 

59.56 

7 

PU,  T 

19 

22.85 

-15.15 

8 

PU,  PT 

18 

18.49 

-17.51 

9 

PT,  U 

19 

39.07 

1.07 

10 

PU,  PT,  ST 

14 

11.89 

-16.11 

11 

PU,  PT,  8 

16 

17.99 

-14.01 

*AXC  -  (-2) (log  Likelihood  Ratio)  -  2 (Degrees  of  Freedom) 

-  AIC(i)  -  AIC<") ,  where  AXC(i)  denotes  the  original  AIC  of 
and  AIC(")  denotes  that  of  the  saturated  model  with  all  the 
parameters  unrestricted. 

Goodman  asserts  that  Hj  and  H2  do  not  fit  the  data  but  Hj  and  H4  do,  where 
denotes  hypothesis  numbered  by  1.  By  the  present  definition  of  AIC  the  negative  signs 
of  AIC  for  H3  and  H4  means  that  the  corresponding  models  are  preferred  to  the  saturated 
non-restrlcted  model.  This  corresponds  to  Goodman's  assertion.  The  AIC  already  suggests 
that  H4  is  an  over-fit  and  Goodman  actually  proceeds  to  the  detailed  analysis  of  Hj  and 
arrives  at  Hj. 

The  significances  of  S  and  T  are  then  respectively  checked  by  comparing  and 

H7  with  Hj.  The  hypothesis  Hg  is  then  judged  to  be  an  improvement  over  H7.  The 
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effect  of  PU  is  then  confirmed  by  comparing  Hg  with  Hg.  Further  elaboration  of  Hg 
leads  to  H1Q.  However,  its  improvement  over  Hg  is  not  considered  to  be  significant, 
although  the  effect  ST  is  judged  to  be  significant  by  the  comparison  of  H^g  with 
H.j i .  The  path  of  Goodman's  stepwise  search  is  schematically  represented  by  Table  2. 

Table  2.  The  path  of  Goodman's  stepwise  search  and  the  corresponding  AIC's* 


72.6 

6.9 

-8.2 

-3.3 

0 

H1 

© 

»3 

»4 

H. 

none 

singles 

pairs 

triplets 

saturated 

\ 


59.6  -11.7 


PU,  S  PU,  S,  T 


PT,  U 


H 


11 


PU,  PT,  S 


The  number  above  each  hypothesis  denotes  the  AIC  relative  to  that  of  H— . 


Table  2  shows  that  we  come  to  one  and  the  same  conclusions  as  those  obtained  by 
Goodman  with  the  choice  of  5%  as  the  critical  level,  simply  by  choosing  models  with  lower 
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9.  Entropy  maximization  principle  and  the  Bayes  procedure 

The  discussion  of  the  concept  of  true  model  an  its  relation  to  entropy  clearly  shows 
that  there  is  no  end  in  statistical  model  building.  All  we  can  do  is  to  produce  better 
models.  When  we  admit  this  then  it  is  easy  to  accept  the  following  very  modest,  yet  very 
productive,  view  of  statistics;  all  statistical  activities  are  directed  to  maximize  the 
expected  entropy  of  the  predictive  distribution  in  each  particular  application.  We  call 
this  the  entropy  maximization  principle  ( Akaike,  1977).  The  minimum  AIC  procedure  may  be 
considered  as  a  realization  of  this  principle.  The  generality  of  this  principle  can  be 
seen  by  the  following  discussion  of  Bayesian  modeling. 

Consider  the  set  of  models  given  by  {g^f*);  k  ■  1,2,...,k},  where  g^ty )  denotes  a 
predictive  distribution  specified  by  the  parameter  k.  Assume  that  we  consider  the  use  of 
a  random  mechanism  for  the  selection  of  the  predictive  distribution.  Our  preference  of  the 
models  is  represented  by  the  distribution  of  probabilities  w^fx)  of  selecting  the  kth 
model,  where  wk(x)  is  specified  by  combining  our  knowledge  of  the  problem  and  the  data 
x.  However,  irrespectively  of  the  form  of  the  true  distribution  of  y,  the  following 
relation  holds 

K  K 

V°g{  ^  9k<Y)wk(x,l  *  l  wk(x>Eylo9  ^k(y)  ' 

where  Ey  denotes  the  expectation  with  respect  to  the  true  distribution  of  y.  This  means 
that  the  entropy  of  the  true  distribution  with  respect  to  the  averaged  distribution 
£  gk(y)wk(x)  is  always  greater  than  or  equal  to  that  with  respect  to  the  distribution 
chosen  by  the  random  mechanism.  The  entropy  maximization  principle  suggests  that  we  should 
consider  the  use  of  the  averaged  distribution  l  g^lyjw^lx)  as  our  predictive  distribution 
rather  than  the  distribution  to  be  chosen  by  the  random  mechanism.  Taking  into  account  the 
fact  that  conventional  model  selection  procedure  is  realized  by  a  particular  choice  of 
wk(x)  which  takes  either  the  value  0  or  1,  the  present  result  suggests  the  possibility 
of  improved  modeling  by  extending  the  basic  set  of  models  from 

1 <  *  )  *  k  ■  1,2,.  ..,K}  to  {E^g^t  *  \  *  0,  £kwk  “  1)  • 
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values  of  A1C.  The  fact  that  AIC  does  not  require  the  table  look-up  of  the  chi-squares 
with  different  degrees  of  freedom  adds  to  the  significance  of  this  result.  Since  AIC  is 
defined  with  a  unique  scaling  unit  it  allows  easy  extraction  of  useful  information  from  a 
collection  of  fitted  models.  For  example,  by  comparing  the  difference  of  AIC* a  of  Hy 
and  Hs  with  that  of  Hg  and  ,  we  can  clearly  see  the  deteriorating  effect  of 
including  S  into  the  model.  Also  the  direct  comparison  of  Hg  and  H^,  not  possible  by 
the  log  likelihood  ratio  test,  is  now  possible  by  AIC  and  the  inferiority  of  Hg  that 
contains  S  is  clearly  recognizable.  The  ability  of  AIC  to  allow  the  researcher  to 
extract  global  information  from  the  result  of  fitting  a  large  number  of  models  is  a  unique 
characteristic  that  is  not  shared  by  the  conventional  model  selection  procedure  realized  by 
some  ad  hoc  application  of  significance  tests. 

AIC  attracted  much  attention  from  people  in  both  theoretical  and  applied  fields  of 
statistics.  In  particular  the  1974  paper  (Akaike,  1974)  has  been  spotted  by  the  Institute 
for  Scientific  Information  as  one  of  the  most  frequently  cited  papers  in  the  area  of 
engineering,  technology  and  applied  sciences,  with  the  frequency  of  citations  over  180 
during  1974-81  (Akaike,  1981).  Some  of  the  theoretical  works  related  with  AIC  are  the 
discussion  of  the  asymptotic  equivalence  of  the  minimum  AIC  procedure  to  cross-validation 
by  M.  Stone  (1977b),  modifications  of  the  criterion  by  Schwarz  (1978)  Hannan  Quinn 
(1979),  discussions  of  the  relation  to  the  Bayes  procedure  by  Zellngp  <1978),  Atkinson 
(1980)  and  Smith  and  Spiegelhalter  (1980)  and  discussions  of  the  optimality  of  the  MAICE 
type  procedure  by  Akaike  (1978a),  Shibata  (1980)  and  C.  J.  Stone  (1982).  The  inherent 
relation  between  the  magic  number  2  and  the  predictive  point  of  view  can  be  seen  also  by 
the  works  by  Geisser  and  Eddy  (1979)  and  Leonard  (1977). 

When  the  number  of  possible  alternatives  is  increased  the  MAICE  procedure  may  tend  to 
be  sensitive  to  sampling  fluctuations.  One  solution  to  this  problem  is  to  use  some 
averaging  procedure  as  is  discussed  in  Akaike  (1979).  However,  this  brings  us  closer  to 
Bayesian  modeling  which  is  going  to  be  discussed  in  the  next  section. 


The  problem  now  Is  how  to  define  w^tx).  Since  the  distribution  w^ ( x ) ,  which  we 
will  cr 11  the  inferential  distribution,  is  introduced  to  define  e  predictive  distribution 
we  will  consider  the  more  general  problem  of  the  selection  of  a  predictive  distribution. 
Assume  that  the  variable  x  takes  a  finite  number  of  discrete  values  x  -  1,2,...,!. 
Before  the  observation  of  x  we  consider  the  selection  of  the  predictive  distribution  of 
x.  we  assume  that  the  possible  predictive  distributions  of  x  are  also  parametrized  as 
f k ( x ) .  Since  x  is  not  available  yet  we  consider  the  use  of  a  probability  distribution 
wk  over  k,  defined  independently  of  x.  Thus  we  are  specifying  a  probability 
distribution  v^f^Cx)  over  (k,x). 

When  the  observation  produces  x  »  xQ  a  Bayesian  will  say  that  we  should  follow  the 
Bayes  procedure  and  replace  the  distribution  w^f^tx)  by  the  distribution  w(k,x)  which 
is  defined  by 


w(k,x) 


wkW 
l  wkW 


for  x  -  x. 


0  otherwise  . 

However,  this  suggestion  is  not  based  on  any  clearly  defined  principle.  One  could  have 

chosen  any  w(k,x)  as  a  function  of  xQ,  if  only  the  distribution  is  limited  to  x  -  xfl 

There  is  an  essential  analogy  between  the  Boltzmann's  derivation  of  the  exponential 

family  of  distributions  for  energy  and  the  Bayes  procedure.  To  see  this  we  consider  more 

generally  an  arbitrary  distribution  »(k,x)  over  (k,x)  and  try  to  find  a  distribution 

w(k,x)  concentrated  on  {(k,xfl)}  and  such  that  the  Boltzman  entropy  with  respect  to  the 

original  w(k,x)  is  maximum.  This  leads  to  the  maximization  of 

l  l  w(k,x){log  »(k,x)  -  log  w{k,x)}  +  w(k,x  )  -  t)  , 
x  k  k 


where  1  is  the  lagrange  multiplier.  The  solution  is  given  by 


w(k,x) 


for  x  »  x, 


*(k,Xg) 

2!  »<k.x  ) 

k 


0  ' 


0  otherwise  . 

This  result  characterizes  the  transition  from  the  original  distribution  to  the 
conditional  distribution  as  the  most  conservative  action  that  conforms  to  the  observation 
of  the  data  xQ  yet  otherwise  maximally  retains  the  structure  of  the  originally  assumed 
distribution.  He  will  call  this  particular  application  of  the  maximum  entropy  method  of 
probability  distribution  generation  by  the  name  of  conditioning  principle. 

Coning  back  to  Bayesian  modeling  we  can  now  see  that  the  assumption  of  the  original 
distribution  *(k,x)  and  the  conditioning  principle  leads  to  the  use  of  the  "posterior 
distribution"  w(k,x)  as  the  inferential  distribution  w^lx).  That  such  a  definition  of 
the  inferential  distribution  is  a  reasonable  one  can  be  shown  as  follows.  First  we  assume 
that  when  k  is  given  y  and  x  are  independent  and  the  distribution  is  given  by 
gk(y)fk(x).  The  expected  performance  of  a  predictive  distribution  h(y|x)  is  then 
evaluated  by  | kBy | *  log  h(y|x),  where  Ek  denotes  the  expectation  with  respect  to 

the  distribution  wk  and  Bx|k  and  By  |k  denote  the  expectations  with  respect  to 
fk(x)  and  g^y) ,  respectively.  He  have 

Vx|A|k  109  h<yl*>  *  l  f<x>  I  I  gk(y)w(k|x)log  h(y|x)  , 

x  y  k 

where  f(x)  ■  t  fk(x)wk  and  w(k|x)  «  fk<x)wk/f(x).  This  quantity  is  maximized  by  putting 

h(y|x)  -  l  ^(y)w(k|x)  , 
k 


which  means  that,  as  long  as  we  assume  the  validity  of  the  original  probabilistic  set-up, 
the  use  of  the  posterior  distribution  w(k|x)  as  the  inferential  distribution  is  the  best 
choice.  This  result  is  recognized  earlier  by  Kerridge  (1961)  and  Aitchison  (1975). 
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10.  Statistical  inference  and  Bayesian  modeling 

What  the  result  of  the  preceding  section  has  shown  is  that  the  conditioning  principle 
leads  to  the  best  choice  of  the  inferential  distribution  under  the  assumption  of  the 
validity  of  the  Bayesian  model  defined  by  f y ) ( x What  would  happen  when  we  are 
uncertain  about  the  choice  of  the  "prior  distribution"  wk? 

Here  we  recall  our  basic  observation  that  statistical  model  building  is  an  unending 
process.  This  means  that  the  validity  of  a  model  can  only  be  established  by  a  careful 
analysis  of  other  possibilities.  This  leads  to  the  situation  where  we  have  several 
alternative  prior  distributions  w ^  (i  “  1.2,...,!).  Here  we  have  to  assume  a  (hyper) 
prior  distribution  ir(i)  over  i's.  When  the  data  x  is  observed  the  posterior 
probability  p(i|x)  of  the  itl*  model  is  given  by  the  relation 

p( i I x)  -  f(i>(x)*(i)  , 

where  f'^fx)  is  the  likelihood  of  the  1th  Bayesian  model  defined  by 

f(i)(x)  -  l  fk(x)w£i)  . 
k 

Thus,  even  when  we  do  not  know  how  to  specify  »(i),  we  can  see  how  much  relative  support 
was  given  to  each  model  by  the  observation  x.  Good  (1965)  called  the  procedure  of 
hyperparameter  estimation  by  maximizing  the  likelihood  of  a  Bayesian  model  the  type  II 
maximum  likelihood  procedure.  The  use  of  the  likelihood  for  the  assessment  of  a  Bayesian 
model  is  demonstrted  in  an  illuminating  paper  by  Box  (1980).  The  application  to  the  very 
practical  problem  of  seasonal  adjustment  is  discussesd  by  the  present  author  (Akaike, 
1980a). 

The  discussion  of  Bayesian  modeling  will  never  be  complete  unless  we  provide  a 
procedure  for  the  modeling  of  the  situation  where  no  further  prior  information  is  available 
for  the  modeling.  The  concept  of  entropy  again  finds  an  interesting  application  in  this 
type  of  situation,  it  has  been  shown  tht  the  well-known  Jeffreys'  ignorance  prior 
distribution  (Jeffreys,  1946)  can  be  given  an  interpretation  as  the  locally  or  globally 
impartial  prior  distribution  (Akaike,  1978b). 


However,  this  concept  is  essentially  dependent  on  the  continuity  of  the  parameter 
involved.  Recently  the  present  author  applied  the  predictive  point  of  view  and  the  concept 
of  entropy  to  define  a  prior  distribution  that  "lets  the  data  speak  most".  For  the 
Bayesian  model  discussed  in  the  preceding  section  this  prior  distribution,  called  the 
minimum  information  prior  distribution,  is  defined  as  the  one  that  maximizes 

Kw)  =  l  l  h(y,x)log  , 

y  x 

where  g(y)  ■  I  9k*y*wk'  ^  ck*x*wk  h{y,x)  »  E  9k^fk*y*wk’  The  8trict 

predictive  point  of  view  demands  us  to  put  g^ty )  -  f^ty).  It  has  been  observed  that  this 
definition  leads  to  interesting  non-trivial  specifications  of  the  prior  distribution  over  a 
finite  discrete  set  of  alternatives  (Akaike,  1982). 

Related  works  in  this  area  are  those  by  Zellner  (1977)  and  Bernardo  (1979)  based  on 
the  earlier  work  of  Lindley  (1956)  who  discussed  the  use  of  the  Shannon  entropy  in 
statistics. 

Do  these  formal  procedures  of  generating  prior  distributions  produce  useful  results? 
The  anwer  can  be  obtained  only  through  the  detailed  analysis  of  the  final  output  of  each 
Bayesian  model  thus  obtained.  An  example  of  such  an  analysis  is  given  by  Akaike  (1980b) 
where  admissibility  is  proved  for  the  James-Stein  type  estimatpr  of  a  multivariate  normal 
distribution  obtained  by  applying  the  ignorance  prior  to  the  byperparameter  of  a  prior 
distribution. 

Here  again  we  are  reminded  of  the  attitude  of  Boltzmann  who  considered  the 
justification  of  the  primary  distribution  used  in  the  derivation  of  the  distribution  of  the 
energy  could  only  be  obtained  through  the  observation  of  the  validity  of  the  final  result. 
The  use  of  a  Bayesian  procedure  can  only  be  justified  when  the  procedure  produces  good 
results  for  those  data  which  are  "similar"  to  the  present  one  and  for  which  unequivocal 
judgment  of  the  results  is  possible. 


11.  Conclusion 


It  is  now  clssr  that  ths  predictive  point  of  view,  particularly  in  its  strict  form, 
and  the  concept  of  entropy  can  produce  a  unifying  view  of  statistics.  This  view  is  not 
only  conceptually  simple  and  unifying  but  also  practically  very  productive.  The  notorious 
difficulty  of  the  significance  test  of  multiple  hypotheses  is  given  a  practical  solution  by 
AIC.  The  historical  split  between  the  Bayesian  and  non-Bayesian  is  now  eliminated. 

The  entropy  maximization  principle  which  is  obtained  by  combining  the  predictive  point 
of  view  with  the  concept  of  entropy  clearly  states  that  the  search  for  better  models  is  the 
purpose  of  statistical  data  analysis.  Bayesian  modeling  will  often  be  useful  in  improving 
the  presently  existing  non-Bayesian  models.  However,  models  are  formulations  of  our  past 
experiences  and  only  new  interesting  real  problems  can  stimulate  the  development  of  useful 
models.  The  fundamental  contribution  by  Boltzmann  came  from  the  deep  study  of  one 
particular  real  problem.  Thus  we  can  see  that  for  the  development  of  statistics  the  main 
emphasis  should  be  placed  on  the  search  for  important  practical  problems.  This  forms  the 
conclusion  of  the  present  paper. 
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